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Abstract. In genome rearrangement theory, one of the elusive questions 
raised in recent years is the enumeration of rearrangement scenarios be- 
tween two genomes. This problem is related to the uniform generation of 
rearrangement scenarios, and the derivation of tests of statistical signifi- 
cance of the properties of these scenarios. Hero wo givo aii (>xact formula 
for the number of double-cut-and-join (DCJ) rearrangement scenarios of 
co-tailed genomes. We also construct effective bijections between the set 
of scenarios that sort a cycle and well studied combinatorial objects such 
as parking functions and labeled trees. 

1 Introduction 

Sorting genomes can be succinctly described as finding sequences of rearrange- 
ment operations that transform a genome into another. The allowed rearrange- 
ment operations arc fixed, and the sequences of operations, called sorting scenar- 
ios, are ideally of minimal length. Given two genomes, the number of different 
sorting scenarios between them is typically huge - we mean HUGE - and very 
few analytical tools arc available to explore these sets. 

In this paper, we give the first exact results on the enumeration and rep- 
resentation of sorting scenarios in terms of well-known combinatorial objects. 
We prove that sorting scenarios using DCJ operations on co-tailed genomes can 
be represented by parking functions and labeled trees. This surprising connec- 
tion yields immediate results on the uniform generation of scenarios [1,13,18], 
promises tools for sampling processes [6, 10, 12] and the development of statisti- 
cal significant tests [7, 11, 16, 21], and offers a wealth of alternate representations 
to explore the properties of rearrangement scenarios, such as commutation [4, 
20], structure conservation [3,8], breakpoint reuse [15, 17] or cycle length [22]. 

This research was initiated while we were trying to understand commuting 
operations in a general context. In the case of genomes consisting of single chro- 
mosomes, rearrangement operations are often modeled as inversions, which can 
be represented by intervals of the set {1,2,..., n}. Commutation properties are 
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described by using overlap relations on the corresponding sets, and a major tool 
to understand sorting scenarios are overlap graphs, whose vertices represent sin- 
gle rearrangement operations, and whose edges model the interactions between 
the operations. Unfortunately, overlap graphs do not upgrade easily to genomes 
with multiple chromosomes, see, for example, [14], where a generalization is given 
for a restricted set of operations. 

We got significant insights when we switched our focus from single rearrange- 
ment operations to complete sorting scenarios. This apparently more c-omplcx 
formulation offers the possibility to capture complete scenarios of length d as 
simple combinatorial objects, such as sequences of integer of length d, or trees 
with d vertices. It also gives alternate representations of sorting scenarios, using 
non-crossing partitions, that facilitate the study of commuting operations and 
structure conservation. 

In Section 3, we first show that sorting a cycle in the adjacency graph of 
two genomes with DCJ rearrangement operations is equivalent to refining non- 
crossing partitions. This observation, together with a result by Richard Stanley 
[19], gives the existence of bijections between sorting scenarios of a cycle and 
parking functions or labeled trees. We give explicit bijections for both in Sec- 
tions 4 and 5. We conclude in Section 6 with remarks on the usefulness of these 
representations, on the algorithmic complexity of switching between representa- 
tions, and on generalizations to genomes that are not necessarily co-tailed. 

2 Preliminaries 

Genomes are compared by identifying homologous segments along their DNA 
sequences, called blocks. These blocks can be relatively small, such as gene cod- 
ing sequences, or very large fragments of chromosomes. The order and orienta- 
tion of the bloc;ks may vary in different gc^iomes. Here we assume that the two 
genomes consist of either circular chromosomes, or co-tailed linear chromosomes. 
For example, consider the following two genomes, each consisting of two linear 
chromosomes: 

Genome A: {a -f -b e -d) (-c g) 
Genome B: {a b c) {d e f g) 

The set of tails of a linear chromosome (a;i . . .Xm) is {xi,— x^}, and two 

genomes are co-tailed if the union of their sets of tails are the same. This is the 
case for genomes A and B above, since the the union of their sets of tails is 
{a,-c, rf, -g}. 

An adjacency in a genome is a sequence of two consecutive blocks. For ex- 
ample, in the above genomes, (e -d) is an adjacency of genome A, and (a b) is 
an adjacency of genome B. Since a whole chromosome can be flipped, we always 
have {x y) = {—y —x). 

The adjacency graph of two genomes A and S is a graph whose vertices are 
the adjacencies of A and B, and such that for each block y there is an edge 



— >■ 


^. — 

1 /I A 


< 


-^,< — 










a 


~v 


c 





i-cg 

5 



(a-ff 



-f-b) 
% -d) 



1 4 



d e f 
2 3 5 



1* I *2 

i-be) 



Fig. 1. At the left, the adjacency graph of genome A = (a -f -b e -d) (-c g) and 
genome B = {a b c){d e f g) is represented by dotted Unes. The sign of a block 
is represented by the orientation of the corresponding arrow. If the - single - cycle 
is traversed starting with an arbitrary adjacency of genome B, here (a 6), in the 
direction of the small arrow, then the 5 adjacencies of genome B will be visited in the 
order indicated by the numbers 1 to 5. At the right, the cycle has been spread out, 
showing that any DCJ operation acting on two adjacencies of genome A that splits the 
cycle can be represented by two cuts on the cycle (12345). 



between adjacency {y z) in genome A and {y z') in genome B, and an edge 
between [x y) in genome A, and {x' y) in genome B. See, for example. Figure 1. 

Since each vertex has two incident edges, the adjacency graph can be decom- 
posed into connected components that axe cycles. The graph of Figure 1 has a 
single cycle of length 10. 

A double-cut-and-join (DCJ) rearrangement operation [5, 23] on genome A 
acts on two adjacencies (x y) and (u v) to produce cither (x v) and (u y), or 
{x —u) and {—y v). In simpler words, a DCJ operation cuts the genome at two 
places, and glues the part in a different order. 

The dist,ance between genomes A and B is the minimum number of DCJ 
operations needed to rearrange - or sort - genome A into genome B. The DCJ 
distance is easily computed from the adjacency graph [5]. For circular chromo- 
somes or co-tailed genomes, the distance is given by: 



d{A, B)=N-{C + K) 



where A'' is the number of blocks, C is the number of cycles of the adjacency 
graph, and K is the number of linear chromosomes in A. Note that K \s & 
constant for co-tailed genomes. A rearrangement operation is sorting if it lowers 
the distance by 1, and a sequence of sorting operations of length d{A, B) is called 
a parsimonious sorting scenario. It is easy to detect sorting operations since, by 
the distance formula, a sorting operation must increase by 1 the number of cycles. 

A DC J operation that acts on two cycles of the adjacency graph will merge 
the two cycles, and can never be sorting. Thus the sorting operations act on a 
single cycle, and split it into two cycles. The central question of this paper is 
to enumerate the set of parsimonious sorting scenario. Since each cycle is sorted 
independently of the others, the problem reduces to enumerating the sorting 
scenarios of a cycle. Indeed, we have: 

Proposition 1. Given scenarios Si, . . . , Sc of lengths ii, . . . ,ic that sort the 
C cycles of an adjacency graph, these scenarios can he shuffled into a global 
scenario in 

(I1 + I2 + ■■■ + ic\ ^ + + 

V ^l,^2,■.■,^c ) tiW....ic\ 

different ways. 

Proof. Since each cycle is sorted independently, the number of global scenarios is 
enumerated by counting the number of sequences that contains £,„ oc;c;urrences 
of the symbol Sm, for 1 < m < C, which is counted by a classical formula. For 
each such sequence, we obtain a scenario by replacing each symbol Sm by the 
appropriate operation on cycle number m. 

3 Representation of scenctrios as sequences of fissions 

A cycle of length 2n of the adjacency graph alternates between adjacencies of 
genome A and genome B. Given a cycle, suppose that the adjacencies of genome 
B are labeled by integers from 1 to n in the order they appear along the cycle, 
starting with an arbitrary adjacency (see Fig. 1). Then any DCJ operation that 
splits this cycle can be represented by a fission of the cycle (123 . . . n), as 

{12i... p\\q...t\\u...n) 

yielding the two cycles: 

(123 ... ... n) and {q. . .t). 

We will always write cycles beginning with their smallest element. Fissions ap- 
plied to a cycle whose elements are in increasing order always yield cycles whose 
elements are in increasing order. A fission is characterized by two cuts, each 
described by the element at the left of the cut. The smallest one, p in the above 
example, will be called the base of the fission, and the largest one, t in the the 
above example, is called the top of the fission. The integer at the right of the 
first cut, q in the example, is called the partner of the base. 



In general, after the application of k fissions on (123 . . .n), the resulting set 
of cycles will contain k + 1 elements. The structure of these c;yc;lcs form a non- 
crossing partition of the initial cycle (123 . . .n). Namely, we have the following 
result, which is easily shown by induction on k: 

Proposition 2. Let k < n — 1 fissions be applied on the cycle (123 . . .n), then 
the k+1 resulting cycles have the following properties: 

1 ) The elements of each cycle are in increasing order, up to cyclical reordering. 

2) [Non-crossing property] If{c...d) and (e . . . /) are two cycles with c < e, then 
either d < e, or c < e < f < d. 

3) Each successive fission refines the partition o/ (123 . . . n) defined by the cycles. 

A sorting scenario of a cycle of length 2n of the adjaceiic;y graph can thus be 
represented by a sequence of n — 1 fissions on the cycle (123 ... n), called a fission 
scenario, and the resulting set of cycles will have the structure (1)(2)(3) . . . (n). 
For example, here is a possible fission scenario of (123456789), where the bases 
of the fissions have been underlined: 

(1234||5|16789) -> (12346789)(5) 
(1234678|l9|j)(5) ^ (1234678)(5)(9) 
(1||234678||)(5)(9) ^ (1)(234678)(5)(9) 
(1)(2||346||78)(5)(9) (1)(278)(346)(5)(9) 
(1)(2||7||8)(346)(5)(9) -> (1)(28)(346)(5)(7)(9) 
(1)(28)(3||46||)(5)(7)(9) -> (1)(28)(3)(46)(5)(7)(9) 
(1)(2||8||)(3)(46)(5)(7)(9) ^ (1)(2)(3)(46)(5)(7)(8)(9) 
(1)(2)(3)(4||6||)(5)(7)(8)(9) -> (1)(2)(3)(4)(5)(6)(7)(8)(9) 

Scenarios such as the one above have interesting combinatorial features when 
all the operations are considered globally, and we will use them extensively in 
the sequel. A first important remark is that the smallest element of the cycle is 
always 'linked' to the greatest element through a chain of partners. For example, 
the last partner of element 1 is element 2, the last partner of element 2 is element 
8, and the last partner of element 8 is element 9. We will see that this is always 
the case, even when the order of the corresponding fissions is arbitrary with 
respect to the scenario. The following definition captures this idea of chain of 
partners. 

Definition 1. Consider a scenario S of fissions that transform a cycle {c. .d) 
into cycles of length 1. For each element p in {c. . . d), if p is the base of one or 
more of the fissions of S, let q be the last partner of p, then define recursively 

Supsip) = Supsiq), 



otherwise, Sups{p) = p. 



In order to see that Sup{p) is well defined, first note that the successive 
partners of a given base p arc always in increasing order, and greater than p. 
Moreover, the last element of a cycle {c. . .d) is never the base of a fission. For 
example, in the above scenario, we would have Sups{l) = Sups{2) = Sups{8) = 
Sups{9) = 9. 

The following lemma is the key to most of the results that follow: 

Lemma 1. Consider a scenario S of fissions that transform a cycle {c. . .d) into 

cycles of length 1, then Sups{c) = d. 

Proof. If c = d, then the result is trivial. Suppose the result is true for cycles of 
length < n, and consider a cycle of length n+ 1. The first fission of 5 will split the 
cycle {c. . .d) in two cycles of length < n. If the two cycles are of the form (c . . . d) 
and (c' . . . d'), then c' is, in the worst case, the first partner of c, and cannot be 
the last since d. Let 5' be the subset of 5 that transform the shorter cycle 
{c . . .d) into cycles of length 1. By the induction hypothesis, Sups'{c) = d, but 
Sups{c) = Sups'{c) since the last partner of c is not in (c' . . . d'). 

If the two cycles are of the form (c . . . d') and {c' . . .d), consider Si the subset 
of S that transform the cycle (c. . . d') into cycles of length 1. and ^2 the subset 
of S that transform the cycle (c' . . . d) into cycles of length 1. We have, by the 
induction hypothesis, Supsi{c) = d' and Sups2{(^) = d, implying Sups{c) = 
Sups{d') and Sups{c') = d. However, c' is the last partner of d', thus Sups{c) = 
Sups{d') = Sups{c') = d. 

4 Fission scenarios and parking functions 

In this section, we establish a bijection between fission scenarios and parking 
functions of length n — 1. This yields a very compact representation of DC J 
sorting scenarios of cycles of length 2n as sequences of n — 1 integers. 

A parking function is a sequence of integers p\P2 ■ ■ -Pn-i such that if the 
sequence is sorted in non-decreasing order yielding p'l < p'2 < ■ ■ ■ < p'n-\^ then 
p'i ^ i- These sequences were introduced by Konheim and Weiss [9] in connection 
with hashing problems. These combinatorial structure are well studied, and the 
number of different parking functions of length n — 1 is known to be n"~^. 

Proposition 2 states that a fission scenario is a sequence of successively refined 
non-crossing partitions of the cycle (123 . . . n). A result by Stanley [19] has the 
following immediate consequence: 

Theorem 1. There exists a bijection between fission scenarios of cycles of the 
form (123 ... n) and parking functions of length n — 1. 

Fortunately, in our context, the bijec;tion is very simple: we list the bases of 
the fissions of the scenario. For example, the parking function associated to the 
example of Section 3 is 48122324. In general, we have: 

Proposition 3. The sequence of bases of a fission scenario on the cycle (123 . . . n) 
is a parking function of length n — 1. 



Proof. Let pip2 ■ ■ -Pn-i be the sequence of bases of a fission scenario and let 
p'iP2 • ■ ■ P'n-1 be the corresponding sequence; sorted in non-d(H;rcasing order. Sup- 
pose that there exists a number i such that p'^ > i, then there are at least n — i 
fissions in the scenario with base p > i + 1. These bases can be associated to 
at most n — z — 1 partners in the set {z + 2, i + 3, i + 4, . . . , n} because a base 
is always smaller than its partner, but this is impossible because each integer is 
used at most once as a partner in a fission scenario. 

In order to reconstruct a fission scenario from a parking function, we first 
note that a fission with base Pi and partner Qi creates a cycle whose smallest 
element is qi, thus each integer in the set {2,3,..., n} appears exactly once as a 
partner in a fission scenario. 

Given a parking function pip^ . . -Pn-i, we must first assign to each base pi a 
unique partner qi in the set {2,3, . . . By Lemma 1, we can then determine 
the top ti of fission i, since the set of fissions from z + 1 to n — 1 contains a 
sorting scenario of the cycle {q^ . . .ti). Algorithm 1 details the procedure. 

Algorithm 1 [Parking functions to fission scenarios] 
Input: a parking function piP2 ■ ■ -Pn-i- 
Output: a fission scenario {pi,ti), . . . , 

Q^{2,3,...,n} 

For p from n — 1 to 1 do: 

For each successive occurrence Pi of p in the sequence PiP2 ■ ■ -Pn-i do: 
qi <— The smallest element of Q greater than pi 
Q^Q\{q,} 

S ^ {{pi,qi),{p2,q2) ■■■,{Pn-l,qn-l)} 

For i from 1 to n — 1 do: 

S\{{pi,qi)} 
ti ^ Sups{qi) 

For example, using the parking function 48122324 and the set of partners 
{2, 3, . . . , 9}, we would get the pairings, starting from base 8 down to base 1: 

/p, : 4 8 1 2 2 3 24\ 
1^(7, : 5 9 2 3 7 4 8 6y 

Finally, in order to recover the second cut of each fission, we compute the 
values ti'. 

/pi : 4 8 1 2 2 3 2 4\ 
\ii:59867686y 

For example, in order to compute t^, then S = {(2, 7), (3, 4), (2, 8), (4, 6)}, and 
Supsi'i) = Supsi^) = SupsiG) = 6. 

Since we know, by Theorem 1, that fissions scenarios are in bijection with 
parking functions, it is sufiicient to show that Algorithm 1 recovers a given 
scenario in order to prove that it is an effective bijection. 



Proposition 4. Given a fission scenario of a cycle of the form (123 . . .n), let 
{Pi,qi,ti) be the base, partner and top of fission i. Algorithm 1 recovers uniquely 
ti from the parking function piP2 ■ ■ -Pn-i- 

Proof. By Lemma 1 , wc only need to show that Algorithm 1 recovers uniquely 
the partner qi of each base pi. Let p be the largest base, and suppose that p has 
j partners, then the original cycle must contain at least the elements: 

{...pp+l...p + j ...). 

We will show that p+ 1 . . .p + j must be the j partners of p. If it was not the 
case, at least one of the j adjacencies in the sequence p p + 1 . . .p + j must be 
cut in a fission whose base is smaller than p, since p is the largest base, and 
this would violate the non-crossing property of Proposition 2. Thus Algorithm 1 
correctly and uniquely assigns the partners of the largest base. Suppose now that 
Algorithm 1 has correctly and uniquely assigned the partners of all bases greater 
than p. The same argument shows that the successive partners of p must be the 
smallest available partners greater than p. 

Summarizing the results so far, we have: 

Theorem 2. If the adjacency graph of two co-tailed genomes has C cycles of 
length 2{ii + 1), . . . , 2{£c + 1), then the number of sorting scenarios is given by: 



{ei + i2 + ... + icy. 



*(£i + l)'^-'*...*(£c + lf 



Each sub-scenario that sort a cycle of length 2{tm + 1) can be represented by a 

parking function of length £„, . 

Proof. Sorting a cycle of length 2{£m + 1) can be simulated by fissions of the 
cycle (12 . . . + 1), which can be represented by parking functions of length £jy^. 
The number of different parking functions of length £m is given by {£m + lY"^~^. 
Applying Proposition 1 yields the enumeration formula. 



5 Fission scenarios and labeled trees 



Theorem 1 implies that it is possible to construct bijcctions between fission 
scenarios and objects that are enumerated by parking functions. This is notably 
the case of labeled tree on n vertices. These are trees with n vertices in which 
each vertex is given a unique label in the set {0, 1, . . . , n — 1}. In this section, we 
construct an explicit bijection between these trees and fission scenarios of cycles 
of the form (123 . . . n). 

Definition 2. Given a fission scenario S of a cycle of the form (123 . . .n), let 

(pi.qi) be the base and partner of fission i. 

The graph Tg is a graph whose nodes are labeled by {0,1, ... ,n— 1}, with an 
edge between i and j, if pi = qj, and an edge between and i, ifpi = l. 



In the running example, the corresponding graph is depicted in Figure 2 (a). 
We have: 

Proposition 5. The graph Tg is a labeled tree on n vertices. 

Proof. By construction, the graph has n vertices labeled by {0,1, ... ,n— 1}. In 
order to show that it is a tree, we will show that the graph has n — 1 edges and 
that it is connected. Since each integer in the set {2,3,..., n} is partner of one 
and only one fission in S and S contains n — 1 fissions, Ts has exactly n — 1 
edges. Moreover, by construction, there is a path between each vertex i ^ and 

in Ts, thus Ts is connected. 

Before showing that the construction of Tg yields an effective bijection, we 
detail how to recover a fission scenario from a tree. 

Algorithm 2 [Labeled trees to fission scenarios] 

Input: a labeled tree T on n vertices. 

Output: a fission scenario . . . , {Pn-i^tn-i)- 

Root the tree at vertex 0. 

Put the children of each node in increasing order from left to right. 

Label the unique incoming edge of a node with the label of the node. 
Relabel the nodes from I to n with a prefix traversal of the tree. 
For i from 1 to n — 1 do: 

Pi ^ The label of the source p of edge i. 

ti <— The greatest label of the subtree rooted by edge i. 

Remove edge i from T 

The following proposition states that the construction of the associated tree 
Ts is injective, thus providing a bijection between fission scenarios and trees. 

Proposition 6. The trees associated to different fission scenarios are different. 

Proof. Suppose that two different scenarios and ^2 yield the same tree T. 
Then, by construction, if T is rooted in 0, for each directed edge from j to i in 
T, if J = then pi = 1 otherwise pi = qj. Moreover, in a fission scenario, if fission 

1 is the first operation having base Pi, then its partner is Qi = pi + I, otherwise 
the non-crossing property of Proposition 2 would be violated. So, using these two 
properties, the sequences of bases and partners of the fissions in the two scenarios 
can be uniquely recovered from T, and thus and S2 would correspond to the 
same parking function. 

The tree representation offers another interesting view of the sorting proce- 
dure. Indeed, sorting can be done directly on the tree by successively erasing the 
edges from 1 to n — 1. This progressively disconnects the tree, and the resulting 
connected components correspond precisely to the intermediate cycles obtained 
during the sorting procedure. 
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Fig. 2. Construction of a fission scenario, (a) The unrooted tree Ts- (b) The tree is 

rooted at vertex 0, and the children of each node are ordered, (c) The labels of the 
nodes are lifted to their incoming edges, (d) The nodes are labeled in prefix order from 
1 to 9. The order of the fissions are read on the edges, the source of an edge represent 
the base p of the fission and its target is the partner. For example, fission #1 has base 
4, with partner 5. 



For example, Figure 3 gives snap-shots of the sorting procedure. Part (b) 

shows the forest after the three first operations, the fourth fission splits a tree 
with six nodes into two trees each with three nodes, corresponding to the cycle 
splitting of the fourth operation in the running example. 



6 Discussion and conclusions 

In this paper, we presented results on the enumeration and representations of 

sorting scenarios between co-tailed genomes. Since we introduced many combina- 
torial objects, we bypassed a lot of the usual material presented in rearrangement 
papers. The following topics will be treated in a future paper. 

The first topic is the complexity of the algorithms for switching between 
representations. Algorithms 1 and 2 are not meant to be efficient, they are rather 
explicit descriptions of what is being computed. Preliminary work indicates that 
with suitable data structure, they can be implemented in 0(n) running time. 
Indeed, most of the needed information can be obtained in a single traversal of 
a tree. 

The second obvious extension is to generalize the enumeration formulas and 
representations to arbitrary genomes. In the general case, when genomes are not 
necessarily co-tailed, the adjacency graph can be decomposed in cycles and paths, 
and additional sorting operations must be considered, apart from operations that 
split cycles [5]. However, these new sorting operations that act on paths create 
new paths that behave essentially like cycles. 

We also had to defer to a further paper the details of the diverse uses of these 
new representations. One of the main benefits of having a representation of a 
sorting scenario as a parking function, for example, is that it solves the problem 



(a) (b) (c) (d) (e) 




Fig. 3. Sorting directly on a tree: erasing successively the edges 1 to n simulates cycle 
fissions by creating intermediate forests. In part (b), the fourth fission will split the 
tree corresponding to cycle (234678) into two trees corresponding to the two cycles 
(346) and (278). 



of uniform sampling of sorting scenarios [2]. There is no more bias attached 
to choosing a first sorting operation, since, when using parking functions, the 
nature of the first operation depends on the whole scenario. The representation 
of sorting scenarios as non-crossing partitions refinement also greatly helps in 
analyzing commutation and conservation properties. 
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